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Abstract: What is the relationship between plausibility logic and the princi- 
ple of maximum entropy? When does the principle give unreasonable or wrong 
results? When is it appropriate to use the rule 'expectation = average'? Can 
plausibility logic give the same answers as the principle, and better answers 
if those of the principle are unreasonable? To try to answer these questions, 
this study offers a numerical collection of plausibility distributions given by the 
maximum-entropy principle and by plausibility logic for a set of fifteen simple 
problems: throwing dice. 
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1 When and how should the maximum-entropy principle 
be applied? 



For the student of plausibility logic 1 , the theory of the principles governing plausible 
inference, the application of the theory in any given problem is crystal clear in prin- 
ciple: (1) The problem is analysed and reduced to a set of propositions {A,} and back- 
qs, ' ground knowledge 7. (2) Some plausibilities P [A ;i . . . A, 2 1 (A, 3 ...A,- 4 )A7] e [0,1] are 

assigned, consistently with the laws below, according to our actual or hypothetical 
knowledge of the situation and to convenience; the 'A,-. . . .A^' represent collections 
of A,s joined by various logical connectives ('-V, 'A', 'V', '=>')■ (3) Finally, using 
the basic laws 



P(-A,|7) = 1-P(A,|7), (la) 

P(A,- A Aj\ 7) = P(A/| Aj A 7) P(A y | /), ( lb) 

P(A ; - V Aj\ 7) = P(Ai\ I) + P(A y | 7) - P(A ; - A Aj\ I), (lc) 

P(A ; - => Aj\ 7) = P(-A,| 7) + P(A y | Ai A 7) P(A,| 7), ( Id) 



'Email: lmana@zperimeterinstitute . ca (remove the z) 

'I call 'plausibility logic' what many other authors call '(Bayesian) probability theory'. 'Logic', 
because it is a generalization of the truth-logical calculus. 'Plausibility', because 'degree of belief is 
unfortunately too unwieldy and many authors still contend that 'probability' = 'frequency' or, perhaps 
worse, 'probability' = '(Lebesgue) measure'. 
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the values of unassigned plausibilities of interest, or at least some bounds for them, 
are calculated — e.g. via fractional linear programming, as explained in the scholarly 
and unfortunately much neglected works of Hailperin |[l];0, §§ 0.4, 4.5, 5.4, 6.2 and 
passim; . If required, some mathematical limits are taken. This procedure applies 
whether we want to calculate the plausibility of a proposition of interest, to explore 
the plausible consequences of a hypothesis, or to see the relationship between the 
plausibilities assigned upon different contexts; and other similar problems. 

When the same student approaches the principle of maximum entropy, he is not 
welcomed by a comparable level of clarity. For example, he is told to use data con- 
sisting in an average value; but does the number of observations contributing to this 
average matter? He is told to equate an average with an expectation; but when and 
why is it reasonable to equate these very different quantities? He is told that the prin- 
ciple yields a plausibility distribution; but does this plausibility concern one of the 
observations contributing to the data, or a new observation? And what is the condi- 
tional of the probability distribution given by the principle? that is, if the principle 
gives a P(A,-| • ), what proposition does the dot stand for? In fact, the specification of 
the conditional, or 'supposal' or 'context', of a plausibility is extremely impor- 
tant, also because the conditionals and the arguments of two plausibilities must match 
in a precise way for the plausibility laws to be used. It is meaningless to multiply, 
e.g., P(A3| /) and P(A2| A\ A I) without further qualifications. 

The various 'proofs' of the maximum-entropy principle in the literature do not 
help the student very much either. The most rigorous of them cover only specific 
situations, leaving out important ones. For example, van Campenhout & Cover |@] 
and Csiszar [9] prove that a uniform i.i.d. model gives, to any observation contribut- 
ing to the average, the same distribution as the maximum-entropy principle, when 
the average used as data comes from a very large number of observations. But they 
are silent on whether that distribution is valid for similar observations not contribut- 
ing to the average. On the other hand, the logics behind many proofs covering this 
last application of the principle have been repeatedly attacked by various authors. I 
recommend Skyrms' H 1QTI and especially Ufnnk's H 1 lh analyses, which give insights 
on several important points (but show confusion about others), many of which are 
repeated here. 

One could even say that there is not just one 'maximum-entropy principle', be- 
cause in the literature this term denotes qualitatively different procedures and prob- 
lems, in which one seeks qualitatively different kinds of distributions — e.g., distri- 
butions of probability, of frequency, or of intensity as in image-reconstruction prob- 
lems. In some of these cases there really is no 'principle' but only a 'rule' which 
appears asymptotically from choosing the frequency- or intensity-distribution M,-/M, 
i - 1, . . . , r, that can be realized in most ways. The number of ways is usually given 
by the multiplicity factor 

M! 

= £(Af,-) exp[MH(Mi/M)], with (AT + l)~ r < e{M t ) s$ 1, (2) 



and this is asymptotically equal il_2j, § 1.2; [jj, § 2.1] to the exponential of a very 
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large multiple of the Shannon entropy of the distribution, 

H(fi) --Xifilnfi, Si > 0, Zi fi = 1 ; In := 0. (3) 

The distribution chosen by 'counting reasons' is therefore the one having maximum 
Shannon entropy. In examples like these the maximum-entropy rule is an obvious 
consequence of plausibility logic with an assumption of symmetry on a particular 
hypothesis space (e.g., the set of outcome sequences) and does not need additional 
principles for its justification. These cases do not concern us here. But the counting 
arguments above make no sense for distributions of plausibility, and the application 
of the rule, especially to new observations, seems to require additional principles 
besides those of plausibility logic. 

2 A numerical comparison: fifteen problems 

The purpose of the present work is to examine the maximum-entropy principle 'in 
action' in a collection of simple problems, to see under which circumstances its re- 
sults are intuitively reasonable or not and how these compare with those given by 
plausibility logic alone. 

The collection of problems is the following: Of N throws of a die we know the 
average a and nothing else; not even the throwing technique or the kind of die used, 
although we assume them to be the same or at least very similar in all throws. We 
want the plausibility distributions for: 

1. the outcomes of one of the N throws contributing to the average, which we call 
'old throws'; and 

2. the outcomes of a throw — of the same or very similar die and with the same 
throwing technique — outside the set of N old throws, which we call a 'new 
throw'. 

We consider the problems obtained by combining the particular values N = 1,6, 12 
and N large, together with a = 6, 5, and 7/2, for a total of fifteen problems. 

The adjectives 'old' and 'new' qualifying the throws have really no temporal 
meaning; in fact, the problem is by assumption completely symmetric with respect 
to the temporal ordering of all throws, or the way they are scattered in space-time. A 
'new' throw could precede an 'old' one, or they could all happen at the same time. 
To stress this I shall use the present tense even with 'old' throws, saying e.g. 'the 
second old throw gives face O'. Plausibility logic is affected by what we know; it is 
immaterial how and when we come to know it (even if it had been by precognition or 
other imaginary ways). 

The old throws are numbered from 1 to N (this numbering, again, bearing no 
temporal meaning); a new throw is assigned the number 0. The proposition stating 
that the outcome of throw j is the face T of the die is denoted by R.. The proposition 
stating that the average of the N throws is a is denoted by A%. Thus the plausibility 
distributions sought are, in symbols, 
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1. P(.R} '\A% A /) (old throw, j = 1) and 

2. F{Rf ] \A%Al) (new throw, j = 0). 

The proposition / states our background knowledge, i.e., mathematically speak- 
ing, our plausibility model. An important question indeed is: what kind of back- 
ground knowledge, or plausibility model, are we implicitly assuming when we use 
the maximum-entropy principle? 



2.1 Exchangeable plausibility models used for comparison 

In our comparison we consider three plausibility models, all three infinitely ex- 
changeable. Infinite exchangeability means that, in assigning a plausibility distri- 
bution to the set of outcomes of any number of old and new throws, we do not care 
which throw each outcome belongs to. An alternative interpretation is the following: 
if we knew enough about the circumstances of each throw, knowledge of all other 
throws would be irrelevant in our plausibility assignment for that throw; but those 
circumstances are unknown and we can only assign a plausibility distribution to their 
various possibilities, which are themselves plausibility-indexed; see 

BQ& In 



both interpretations the plausibility distribution for the outcomes of throws j„ 
has the form 

P{R^ A • • • A R^\ I) = J Pil ■ ■ ■ p in g(p\ I) dp; (4) 
the integration is over the simplex of plausibility distributions 

A:={p:=(p 1 ,...,p 6 )\p i ^0,XiPi = i}. (5) 

The generalized function Jl6l - [l8ll g characterizes the exchangeable model chosen; 
g dp can be interpreted as the plausibility density for the limiting frequencies of the 
outcomes as the number of new throws increases indefinitely or, in the alternative 
interpretation of infinite exchangeability, as the plausibility density for those circum- 
stances with index values around the volume element dp. See app. O for some re- 
marks about this volume element. 

The choice of g(p\ I) defines our exchangeable model /. For example, for a gen- 
eralized density concentrated on the uniform distribution, 

g(p|/ft)dp:=nS(p/-l/6)dp, (6) 

i 

the model If t gives each throw a uniform plausibility distribution, independent of the 
knowledge about all other throws and identical for all throws, 'i.i.d.'. It represents 
the unshakeable belief that the die and throwing technique be absolutely 'fair'. We 
call this the fair-throw model; it will be the first to be compared with the maximum- 
entropy principle. 

Suppose, on the other hand, that we judge the plausibility for face T on a new 
throw to depend only on N and on how many old throws yield face T; in other 
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words, the frequencies of the other faces of old throws are irrelevant. Then we have 
the Johnson model, also called Dirichlet model, if. It depends on a parameter K and 
is represented by the density 



g{p\lf)dp:= m+ZtK) 



n— 



dp, 



K>0; 



(7) 



see Johnson Good lOjJ ch. 4], Zabell |20], Jaynes |21], and references therein. 

For the values K = 1, 2, 5, 50 and K large this will be the second model in our numer- 
ical comparison. The case K = 1 (a constant density) corresponds to the multidimen- 
sional form of Bayes' 11221 Scholium] and Laplace's suggestion II23L p. xvn]: 'Quand 
la probabilite d'un evenement simple est inconnue, on peut lui supposer egalement 
toutes les valeurs depuis zero jusqu'a l'unite'. See Jaynes 11241 ch. 11] and Stigler 
02511 for interesting discussions and references on this case. The case K - 1/2, not 
discussed here, was advocated by Jeffreys 12611 . 

The third model 1^ in our comparison is defined by the suggestive density 



g<j>\&)dp:= c(L)- 



:dp, L ^ 1; 



UiiLpdl 

with c(L) a normalization factor and, here and in the following, 

xl := V(x + 1). 



(8) 



(9) 



I have been unable to characterize this model in more intuitive terms or to derive it 
from particular assumptions as can be done for the Johnson one. This is an interesting 
problem which deserves further study. Apart from the normalization, the expression 
above is (intentionally, as we shall see later) identical in form with the multiplicity 
factor © and for this reason I call this the multiplicity model with parameter L. We 
shall use the values L = 1, 2, 5, 50 and L large. The case L = 1 (a generalized density 
prop ortional to YliPj 1 ) was proposed by Haldane J27I1 and is discussed by Jeffreys 
HE § 3.1, p. 123 ff.] and Jaynes H § VII]; see also Zellner H § 2.13]. 

As you have already guessed, this model has been chosen because for appropriate 
values of the parameter L it gives the same distributions as the maximum-entropy 
principle. 

Further mathematical properties of our models are discussed in app. |B] 



3 Main results 

The plausibilities assigned by the maximum-entropy principle and our three ex- 
changeable models in the fifteen problems are derived app. [B] and presented in the 
tables of app. |A] p. [13] The features that strike me most in these tables are the fol- 
lowing. 
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Cases with one or two old throws, i.e. N - 1 or 2, with an average a - 5: In 

the first case we know for sure that the old throw must give H, so its plausibility 
distribution must be (0, 0, 0, 0, 100, 0) %. In the second case we know that the faces 
□, H cannot appear because the total sum of the two old throws must be 10; so the 
distribution must have the form (0, 0, 0, • , • , • ). If we interpret the maximum-entropy 
distribution as referring to old throws, it is in these cases not just unreasonable, but 
plainly wrong. The maximum-entropy principle is therefore not meant to be used for 
old throws when N is small. 

All three exchangeable models give correct results instead. This was expected: in 
situations of certainty the plausibility calculus reduces to the truth-logical calculus. 

Cases with N - 1 or 2 and a = 6: We know that the one or two old throws give 
O. This is certainly no ground to suppose that the throwing technique or the die 
be completely biased towards the face O; we are in fact excluding any particular 
detailed background knowledge, as e.g. that all faces of the die have the same, un- 
known, number of pips — so that in these cases that number is revealed to be 6. In 
general, knowledge of the outcome of only one or two old throws should not make 
our predictions for a new throw deviate very much from the uniform distribution. If 
interpreted to refer to a new throw, the maximum-entropy distribution is therefore 
quite unreasonable since it concentrates all plausibility on face O. The same is true 
for the N = 1 or 2, a - 5 cases. The maximum-entropy principle is therefore not 
meant to be applied for new throws when N is small. 

The Johnson and multiplicity models on the other hand give reasonably more 
uniform distributions for the new throw, especially for larger values of the parameters 
K and L. 

Note also how the exchangeable models respect the logical symmetries of these 
cases: When N = 1 the cases a - 6 and a - 5 are completely symmetric under the 
exchange of those faces (of all faces, indeed). When N - 2 and a = 6 our knowledge 
— that only face O appears in old throws — is symmetric under exchange of all other 
faces; this symmetry is respected by the exchangeable models in both old and new 
throws. When N = 2 and a = 5we know that the outcomes must be (□, U), (H, H), or 
(O, □); this symmetry under permutation of □, O, and of □, H is again respected 
in the exchangeable models. 

Case with N = \ (or N odd) and a = 7/2: An obviously impossible case. This is 
reflected in the fact that plausibility logic yields for old and new throws: any infer- 
ence is completely arbitrary and unreliable because we have been given contradictory 
data. Yet the maximum-entropy gives the uniform distribution. 

We have concluded that the maximum-entropy principle is not meant to be ap- 
plied for either old or new throws when N is small. But how small is 'small'? Let us 
examine the examples with six old throws. 
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Case N = 6, a = 6: All six old throws must have outcome O. The maximum-en- 
tropy principle and the exchangeable models give the correct distribution. Knowing 
that six old throws give O, which plausibilities would you assign for a new throw? 
I should be surprised at the set of sixes, but should not conclude yet that the throw- 
ing technique or the die be completely biased towards D, as the maximum-entropy 
principle suggests instead. So I find the latter's result unreasonable. 

The multiplicity model with L - 50 gives in my opinion the most reasonable 
distribution for a new throw. Also the Johnson model with K somewhere between 5 
and 50 gives a reasonable distribution. 

Case N = 6, a = 5: The maximum-entropy distribution is reasonable when inter- 
preted as a distribution for old throws, but I still prefer the Johnson and the multi- 
plicity models with K and L around 50 that give a few percents less to the D and H 
faces. 

For a new throw, the maximum-entropy distribution is less reasonable; I find 
that concentrating 75 % of the plausibility on H and O is too much. Again, the 
multiplicity model with L - 50 or the Johnson with K between 5 and 50 give results 
most reasonable for me. 

So with an average based on six old throws the maximum-entropy principle still 
gives unsatisfactory answers. Let us look at the cases with twelve throws. 

Case N - 12, a - 6: We know again that all twelve old throws give O; the maxi- 
mum-entropy principle and the three exchangeable models give the correct answer. 
What about a new throw? I should start to believe that the die or throwing technique 
be biased towards O, but should still not be sure that they be completely biased. 
So the maximum-entropy principle's answer is still unreasonable for me. I find the 
Johnson and the multiplicity models with low parameter values, just above 1, more 
reasonable, but unreasonable with higher parameter values (cf. §|4]). 

Case ./V - 12, a - 6 or 5: The maximum-entropy principle and all exchangeable 
models give reasonable distributions for an old throw. For a new throw I find the 
maximum-entropy distribution too concentrated on the faces El and D; I prefer the 
exchangeable models with lower parameter values. 

So for the problems with N = 12 the maximum-entropy principle gives more 
reasonable, though not yet satisfactory, results. 

Let us consider very large values of N then. Here I find the maximum-entropy 
distributions reasonable, for both old and new throws. How do the exchangeable 
models compare? 

Case with N large, a = 6: The maximum-entropy principle as well as the Johnson 
and multiplicity models say that the larger the number of old throws that give O, the 
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larger the plausibility that the die or throwing technique are completely biased toward 
face O for a new throw as well; all plausibility is asymptotically concentrated on that 
face. This is what I should indeed believe. 

Case with N large, a = 5: This case is the most interesting. As usual, when the 
parameters K or L are much larger than N both the Johnson and multiplicity models 
behave like the fair-throw one, as explained is app. |Bj For parameter values which 
are large, but small compared to the number of old throws, the Johnson model gives a 
distribution asymptotically equal to that of the maximum-entropy principle with the 
Burg entropy [31], rather than the Shannon one. 

On the other hand, the multiplicity model gives exactly the same distribution as 
the usual maximum-( Shannon- )entropy principle. This result is extremely important 
and holds for both old and new throws. 

Cases with 1 < N ^ 12, a = 7/2: In all these problems the maximum-entropy 
principle and the exchangeable models for larger K and L give reasonable distribu- 
tions, for both old and new throws. Note how all exchangeable models including the 
fair-throw one give, for old throws, slightly larger plausibilities to outcomes nearer H 
or □. This is not strange, as some counting shows. For example, among the 146 sets 
of four-throw outcomes summing up to 14 the face □ appears 2 1 times as first-throw 
outcome whereas the face H 27 times. 

Ironically, the results of the exchangeable models have in many cases greater 
Shannon entropies than those of the maximum-entropy principle. If we regard the 
Shannon entropy as a measure of 'incertitude' in a plausibility distribution, then the 
exchangeable models give in those cases more 'uncertain' or, as Jaynes would say, 
less 'committing' answers than the maximum-entropy principle. This is not in con- 
tradiction with the principle, of course: in those cases, the exchangeable models do 
not satisfy the constraint that the average be equal to the expectation. See the discus- 
sion in the next section. 

4 Conclusions and various remarks 

Most of the conclusions I state here are personal, in the sense that they are not only 
a matter of logic but of taste as well. I invite you to peruse the numerical results 
presented in the tables of app. |A] and arrive at your own conclusions. 

When the number of throws on which the average is calculated is small, the maxi- 
mum-entropy principle gives very unreasonable or even wrong results, depending on 
whether its distribution is interpreted as referring to new or old throws. The perfor- 
mance gets better the larger the numbers of throws, but with twelve of them I still 
find the maximum-entropy distributions unreasonable. 

On which grounds do I say 'reasonable' or 'unreasonable'? This is a very impor- 
tant and interesting question. At first, my judgements were almost instinctive; they 
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came from background knowledge not immediately present to the mind (see Jeffreys' 
colourful discussion on this point 11281 § 3. 1, pp. 123-124]). With some introspection, 
I can say my grounds are these: I think it is very difficult to master such a throwing 
technique or to construct such a regularly numbered die as would lead me to assign 
a plausibility distribution remarkably different from the uniform one. Therefore I 
choose, on the simplex of distributions, a density very peaked around the uniform 
distribution. It is easy, on the other hand, to construct a die with two or more faces 
showing the same number of pips. I should reflect this by superposing to my previous 
density another concentrated on the facets of the simplex. Thus, neither the Johnson 
nor the multiplicity model represents my background knowledge exactly. I can also 
add that if the 'die' were very irregular, e.g. with one length twice the other two, I 
should complain of having been given false data, since I should not call that a 'die'. 

The plausibility model I choose reflects my knowledge about dice-throwing. In 
a problem regarding something else, e.g. the energy of some physical system, my 
knowledge and hence the model used would probably be different. In some situations 
it would even be reasonable to use non-exchangeable models. 

The numerical comparison presented here shows that the use of plausibility logic 
gives us more possibilities of 'fine-tuning', of better representing our background 
knowledges, than the principle of maximum entropy. Enthusiasts for this principle 
would probably argue that in its full generality it allows for finer tuning too: we can 
choose any convex region on the simplex of distributions as constraint, representing 
the posterior distributions we deem acceptable. But the full use of plausibility logic 
is even more flexible: first, we can give different plausibilities to different regions; 
second, the latter need not be convex. And, most importantly, plausibility logic does 
not ask us to choose amongst posterior distributions, but to carefully specify a prior 
one on an appropriate hypothesis space — it requires us to examine whence we start 
(including what question we are asking), not whither we want to arrive. And in 
inference problems this is always advisable, lest we let our wishes, instead of the 
facts, suggest what is more or less plausible. 

Advocates of the maximum-entropy principle may also contend that in the cases 
in which this gives apparently wrong results (small AO, it is because we have chosen 
wrong or insufficient constraints. For instance, if we know that for N = 2 and a = 5 it 
is logically impossible that some old throw give □, or H, then we ought to impose 
this as a constraint. I might accept this argument but still think that plausibility logic 
is superior since it reveals these constraints to us, as logical consequences of the 
situation, without requiring us to put them again explicitly into the theory. When I 
first obtained the results for the N = 2, a = 5 case I was indeed surprised seeing that 
all exchangeable models give distributions of the form (0, 0, 0, x, y, x) for old throws 
and (z, z, z, x, y, x) for a new one. An analysis of the possible outcomes in this case 
then showed why this must be logically so, as explained in § [3] Simple happenings 
like this show the beauty and power of plausibility logic. 

As regards constraints, we have seen that in many cases the rule 'expectation = 
average' is not respected by the more reasonable exchangeable models, which can 
therefore have greater entropy than the maximum-entropy distribution. Because in 
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some cases that constraint rule is absurd. In general its range of applicability is 
'subjective', a fact that is seldom stressed enough; consensus and thus objectivity 
are only reached in limit cases. By 'subjective' I do not mean 'subject to personal 
quirks', as de Finetti seems to imply sometimes [32], but 'perceptibly dependent on 
almost imperceptible differences in background knowledge' (as a chaotic dynamics 
on initial conditions). 



Although our study only concerns a special example, it is easy to see how it can 
be generalized to a general theorem: The plausibility distribution given by the maxi- 
mum-entropy principle for new observations is the same as that given by a particular 
class of infinitely exchangeable models in a well-defined limit case. This was known, 
apparently, — see Skilling [33], Rodriguez 134143611 . and references therein — but I 
have never seen it said explicitly. I prefer not to say that the principle is 'derived' 
from plausibility logic, for the former's formulation is based on different axioms and 
primitives than the latter's, and these we have not derived. But the procedure to obtain 
the maximum-entropy distribution can be justified by plausibility logic without the 
need of those additional principles. 

We need a characterization of the class of models from which the principle stems, 
though. Interesting studies by Skilling, Rodriguez, Caticha & Preuss try to charac- 
terize a model in this class by invoking the maximum-entropy principle again; more 
about this below. It would be interesting to find a characterization of the multiplic- 
ity model similar to that, mentioned in § 12.11 of the Johnson model; or in terms of 
a symmetry on a particular hypothesis space. This characterization is also impor- 
tant because one could try to generalize it to other hypothesis spaces beyond the 
exchangeable-model one ('plausibilities of plausibility-indexed circumstances'). In 
this way we could obtain, from plausibility logic, the form of an entropy for general 
statistical models (in the sense of Mielnik and Holevo SO]); see e.g. the studies 
by Band & Park Mll-B. Slater JUS, Porta Mana & Bjork Q, Barnum et al. 
& 

The derivation of the maximum-entropy procedure from plausibility logic is also 
useful because it clearly shows in which situations the maximum-entropy principle 
can be applied, and provides reasonable results in those situations in which the prin- 
ciple's answers are unreasonable or plainly wrong. The way this is mathematically 
achieved is explained at the end of app. [B] The derivation also makes clear that there 
may be situations in which we can reasonably assign distributions different from 
those of the maximum-entropy principle. For example, if we had reasons to use a 
Johnson model in our inference, the conclusions of the maximum-entropy principle 
(with Shannon's entropy) would obviously be at variance with those reasons. 

The (large) parameter L of the multiplicity model gives the order of magnitude 
of the number of old throws N at which the maximum-entropy principle begins to 
approximate our conclusions, as mathematically explained in app. [B] How should 
the value of this parameter be chosen? The answer depends on the problem, as does 
the choice of exchangeable — or non-exchangeable — model. In our dice examples 
I should use a value of 50 or slightly larger. In a problem in which I could examine 
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the die and the throwing technique and get the impression that they are 'fair', the 
value would be even higher. 

What place in inference has the maximum-entropy principle then? As an 'update 
rule' it seems superfluous, since plausibility logic already provides us with such a 
rule, which we have seen to give more satisfactory results. Is it a procedure to assign 
'prior' plausibilities, as Jaynes continually stressed? But then using average data 
would be inappropriate, because they are the kind of data that can be used in B ayes' 
rule instead. Is it a principle for selecting a prior distributions among those we deem 
appropriate? Perhaps: we want a distribution that give highest plausibility to such- 
and-such average; which to choose? Let's take that with maximum entropy. And 
yet, it would be better to ask why we want that such-and-such average have highest 
plausibility. If is it because we have observed that average in similar situations, why 
not just apply plausibility logic and Bayes' theorem, as we have done in this study? 
Jaynes B52I. p. 27] states that the maximum-entropy principle 'is designed to cover 
more general situations, where it does not make sense to speak of "trials'". But 
I have failed to find, even in Jaynes' writings, any examples of such 'more general 
situations' or at least situations not involving some kind of repetitions of observations 
('trials'). 

The mystery about the foundations of the maximum-entropy principle remains. 
The reason of the present study stemmed from my strictly personal opinion that any 
'update rule' is (a) a special case of the general update rule of plausibility logic, or 
(b) inconsistent, or (c) not an update rule; and that any procedure for assigning prior 
plausibilities from some data is (a) a special case of a plausibility-logic updating 
within a particular model, or (b) inconsistent, or (c) not a procedure for assigning 
priors. I wanted to be sure that the maximum-entropy principle did not fall into the 
© alternatives. My belief now is that the © alternatives hold; although there is a 
small possibility that the (jc]) be right instead. 

Any use of the maximum-entropy principle involving observational data usurps 
the just and enlightened throne of plausibility logic. Therefore the only place suitable 
for the principle seems to be, not in the choice of a distribution, but in the construction 
of prior densities over a space of such distributions, where no observational data are 
directly involved. Here, however, we have an infinite-dimensional simplex and the 
principle requires the prior definition of a 'canonical' density J53I § 4.b] (which is 
usually different from that discussed in app.|C]): we have a chicken-and-egg problem. 
And what kind of constraints should be chosen for the princi ple, in such a space? The 
studies on 'entropic priors' of Skilling J33; 54], Rodriguez, and Caticha 



& Preuss |56; (57D are interesting in this respect, although they still leave me largely 



unconvinced for the time being, for reasons that may be explained elsewhere. 

A final note: the above discussion concerns the logical place and exact premises 
of the maximum-entropy principle (which are necessary also for pedagogical pur- 
poses), but does not affect its usefulness and efficacy, both witnessed by the vast 
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range of applications and the number of books written about, and thanks to, this 
principle. 
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A Tables of results 

Here are the distributions, for an old and a new throw, given by the maximum-entropy 
principle and the fair-throw, Johnson, and multiplicity models described in § 12.11 The 
values are rounded to decimals of percentile; this leads in many cases to unnormal- 
ized totals. The Shannon entropy of each distribution is also given, within crotchets. 

The general formulas used are derived in app. |Bj where mention is also made of 
the routines used, when necessary, for numerical integration. See the same appendix 
also for explanation of the remarks appearing under some distributions. 



N = 1, a = 6 

model old throw, P(R] | A N a a /)/% [ff/nat] new throw, P(/?9| A N a A /)/% [#/nat] 

ME (0,0,0,0,0, 100) [0] 

fair-t. /ft (0,0, 0,0, 0,100) [0] uniform distribution irrespective of a 
Johnson If: 

K=\ (0,0,0,0,0, 100) [0] (14.3, 14.3, 14.3, 14.3, 14.3, 28.6) [1.749] 

A" = 5 (0,0,0,0,0,100) [0] (16.1, 16.1, 16.1, 16.1, 16.1, 19.4) [1.788] 

A" = 50 (0,0,0,0,0, 100) [0] (16.6,16.6,16.6,16.6, 16.6, 16.9) [1.791] 

K large (0, 0, 0, 0, 0, 1 00) [0] uniform distribution irrespective of a 

like fair-throw model 

multiplicity 

L=1 (0,0,0,0,0, 100) [0] (14.4, 14.4, 14.4, 14.4, 14.4, 28.2) [1.752] 

L = 5 (0,0,0,0,0,100) [0] (14.9,14.9,14.9,14.9, 14.9, 25.6) [1.767] 

L = 50 (0,0,0,0,0,100) [0] (16.3,16.3,16.3,16.3, 16.3, 18.3) [1.789] 

L large (0, 0, 0, 0, 0, 1 00) [0] uniform distribution irrespective of a 



N= J \,a = 5 



model old throw, P(if! | A N a a /)/% [#/nat] new throw, P(fi?| A% A /)/% [ff/nat] 



ME 

fair-t. / ft 

Johnson if: 
K = 1 
K = 5 
K = 50 
A* large 



multiplicity /^: 
L= 1 
L = 5 
L = 50 
L large 



(2.1 , 3.9, 7.2, 1 3.6, 25.5, 47.8) [1 .370] 



(0,0,0,0,100, 0) [ 

(0,0,0,0,100, 0) [ 
(0, 0, 0,0,100, 0) [ 
(0, 0, 0,0,100, 0) [ 
(0,0,0,0,100,0) [ 

like fair-throw model 

(0,0,0,0,100, 0) [ 
(0, 0, 0,0,100,0) [ 
(0, 0, 0,0,100,0) [ 
(0,0, 0,0,100,0) [ 



uniform distribution irrespective of a 

(1 4.3, 1 4.3, 1 4.3, 1 4.3, 28.6, 1 4.3) [1 .749] 
(16.1,16.1,16.1,16.1,19.4,16.1) [1.788] 
(1 6.6, 1 6.6, 1 6.6, 1 6.6, 1 6.9, 1 6.6) [1 .791 ] 
uniform distribution irrespective of a 



(1 4.4, 1 4.4, 1 4.4, 1 4.4, 28.2, 1 4.4) [1 .752] 
(1 4.9, 1 4.9, 1 4.9, 1 4.9, 25.6, 1 4.9) [1 .767] 
(1 6.3, 1 6.3, 1 6.3, 1 6.3, 1 8.3, 1 6.3) [1 .789] 
uniform distribution irrespective of a 



like fair-throw model 
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N = 1 , a = 7/2 



model 


old throw, ~9{R\\A N a A 1)1% [///nat] new throw, P(fl°| A" A /)/% [///nat] 


ME 


(1 6.7, 1 6.7, 1 6.7, 1 6.7, 1 6.7, 1 6.7) [1 .793] 


all exch. models undefined 


undefined 


N = 2, 


a = 6 




model 


old throw, P(R]\A% A 1)1% [///nat] 


new throw, P(/?"| A N a A /)/% [///nat] 


ME 


(0,0,0,0,0,100) [0] 


fair-t. /ft 


(0,0,0,0,0,100) [0] 


uniform distribution irrespective of a 


Johnson if: 

K=\ (0,0,0,0,0,100) [0] 
K = 5 (0,0,0,0,0,100) [0] 
AT = 50 (0,0,0,0,0,100) [0] 
/C large (0,0,0,0,0, 100) [0] 


(12.5,12.5,1 2.5, 1 2.5, 1 2.5, 37.5) [1 .667] 
(15.6,15.6,1 5.6, 1 5.6, 1 5.6, 21 .9) [1 .782] 
(1 6.6, 1 6.6, 1 6.6, 1 6.6, 1 6.6, 1 7.2) [1 .793] 
uniform distribution irrespective of a 




like fair-throw model 




multiplicity /£: 

L=1 (0,0,0,0,0,100) [0] 
L = 5 (0,0,0,0,0,100) [0] 
L = 50 (0,0,0,0,0,100) [0] 
L large (0,0,0,0,0, 100) [0] 


(12.6,12.6,1 2.6, 12.6,1 2.6, 36.9) [1 .672] 
(13.5,13.5,1 3.5, 13.5,1 3.5, 32.3) [1 .71 7] 
(16.0,16.0,1 6.0, 1 6.0, 1 6.0, 1 9.8) [1 .787] 
uniform distribution irrespective of a 




like fair-throw model 




N = 2, 


a = 5 




model 


old throw, P(A > ' | A" A /)/% [///nat] 


new throw, P(/?°| A N a A 1)1% [///nat] 


ME 


(2.1 , 3.9, 7.2, 1 3.6, 25.5, 47.8) [1 .370] 


fair-t. /ft 


(0,0,0, 33.3,33.3, 33.3) [1.10] 


uniform distribution irrespective of a 


Johnson if: 

K=-\ (0, 0, 0, 25.0, 50.0, 25.0) [1 .04] 
K = 5 (0,0,0,31.2,37.5,31.2) [1.09] 
A: = 50 (0, 0, 0, 33. 1 , 33.8, 33.1 ) [1 .1 0] 
K large (0, 0, 0, 33.3, 33.3, 33.3) [1.10] 


(1 2.5, 1 2.5, 1 2.5, 1 8.8, 25.0, 1 8.8) [1 .755] 
(1 5.6, 1 5.6, 1 5.6, 1 7.6, 18.0,1 7.6) [1 .790] 
(1 6.6, 1 6.6, 1 6.6, 1 6.8, 1 6.8, 1 6.8) [1 .793] 
uniform distribution irrespective of a 




like fair-throw model 




multiplicity /£: 

L = 1 (0, 0, 0, 25.4, 49. 1 , 25.4) [1 .05] 
L = 5 (0,0,0,26.9,46.1,26.9) [1.06] 
L = 50 (0, 0, 0, 32.0, 35.9, 32.0) [1.10] 
L large (0, 0, 0, 33.3, 33.3, 33.3) [1.10] 


(1 2.6, 1 2.6, 1 2.6, 1 8.8, 24.5, 1 8.8) [1 .756] 
(1 3.4, 1 3.4, 1 3.4, 1 8.8, 22.1 , 1 8.8) [1 .770] 
(1 6.0, 1 6.0, 1 6.0, 1 7.3, 1 7.4, 1 7.3) [1 .791 ] 
uniform distribution irrespective of a 




like fair-throw model 
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N = 2, a = 7/2 



model 



old throw, P(R}\A% A 1)1% [ff/nat] 



new throw, P(R°\A% A 1)1% [ff/nat] 



ME 



(1 6.7, 1 6.7, 1 6.7, 1 6.7, 1 6.7, 1 6.7) [1 .793] 



fair-t. /„ (1 6.7, 1 6.7, 1 6.7, 1 6.7, 1 6.7, 1 6.7) [1 .793] uniform distribution irrespective of a 



Johnson if: 

K = 1 (1 6.7, 1 6.7, 1 6.7, 1 6.7, 1 6.7, 1 6.7) [1 .793] 
K = 5 (1 6.7, 1 6.7, 1 6.7, 1 6.7, 1 6.7, 1 6.7) [1 .793] 
K = 50 (1 6.7, 1 6.7, 1 6.7, 1 6.7, 1 6.7, 1 6.7) [1 .793] 
K large (1 6.7, 1 6.7, 1 6.7, 1 6.7, 1 6.7, 1 6.7) [1 .793] 

like fair-throw model 

multiplicity / m : 

L = 1 (1 6.7, 1 6.7, 1 6.7, 1 6.7, 1 6.7, 1 6.7) [1 .793] 
L = 5 (1 6.7, 1 6.7, 1 6.7, 1 6.7, 1 6.7, 1 6.7) [1 .793] 
L = 50 (1 6.7, 1 6.7, 1 6.7, 1 6.7, 1 6.7, 1 6.7) [1 .793] 
L large (1 6.7, 1 6.7, 1 6.7, 1 6.7, 1 6.7, 1 6.7) [1 .793] 



(1 6.7, 1 6.7, 1 6.7, 1 6.7, 1 6.7, 1 6.7) [1 .793] 
(16.7,16.7,1 6.7, 1 6.7, 1 6.7, 1 6.7) [1 .793] 
(1 6.7, 1 6.7, 1 6.7, 1 6.7, 1 6.7, 1 6.7) [1 .793] 
uniform distribution irrespective of a 



(1 6.7, 1 6.7, 1 6.7, 1 6.7, 1 6.7, 1 6.7) [1 .793] 
(1 6.7, 1 6.7, 1 6.7, 1 6.7, 1 6.7, 1 6.7) [1 .793] 
(16.7,16.7,1 6.7, 1 6.7, 1 6.7, 1 6.7) [1 .793] 
uniform distribution irrespective of a 



like fair-throw model 



N = 6, a = 6 

model old throw, P(Rj \ A N a a 1)1% [tf/nat] new throw, P(^f| A N a A /)/% [tf/nat] 



ME 

fair-t. /ft 

Johnson ij: 
K= 1 
K = 5 
K = 50 
K large 



multiplicity / m : 
L= 1 
L = 5 
L = 50 
L large 



(0,0,0,0,0,100) [0] 



(0,0,0,0,0,100) [0] 

(0,0,0,0,0,100) [0] 
(0,0,0,0,0,100) [0] 
(0,0,0,0,0,100) [0] 
(0,0,0,0,0,100) [0] 

like fair-throw model 

(0,0,0,0,0,100) [0] 
(0,0,0,0,0,100) [0] 
(0,0,0,0,0,100) [0] 
(0,0,0,0,0,100) [0] 



uniform distribution irrespective of a 

(8.3, 8.3, 8.3, 8.3, 8.3, 58.3) [1 .347] 
(1 3.9, 1 3.9, 1 3.9, 1 3.9, 1 3.9, 30.6) [1 .734] 
(1 6.3, 1 6.3, 1 6.3, 1 6.3, 1 6.3, 1 8.3) [1 .789] 
uniform distribution irrespective of a 



(8.5, 8.5, 8.5, 8.5, 8.5, 57.5) [1 .366] 
(1 0.0, 1 0.0, 1 0.0, 1 0.0, 1 0.0, 50.0) [1 .498] 
(1 5.0, 1 5.0, 15.0,1 5.0, 1 5.0, 24.8) [1 .769] 
uniform distribution irrespective of a 



like fair-throw model 
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N = 6, a = 5 

model old throw, P(Rj | A" a 1)1% [#/nat] new throw, P(/{?| A^ A /)/% [tf/nat] 



ME 



(2.1 , 3.9, 7.2, 1 3.6, 25.5, 47.8) [1 .370] 



fair-t. /ft (1.1,3.3,7.7,15.4,27.6,45.0) [1.362] uniform distribution irrespective of a 
Johnson if: 

K=1 (1 .7, 3.3, 6.7, 1 3.3, 31 .7, 43.3) [1 .358] 
K = 5 (1 .4, 3.5, 7.3, 1 4.7, 27.5, 45.5) [1 .363] 
K = 50 (1.1, 3.3, 7.6, 1 5.3, 27.6, 45.0) [1 .360] 
K large (1.1, 3.3, 7.7, 1 5.4, 27.6, 45.0) [1 .362] 



like fair-throw model 

multiplicity I„: 

L = 1 (1 .7, 3.4, 6.7, 1 3.4, 31 .2, 43.6) [1 .360] 
L = 5 (1.6,3.5,6.9,14.1,29.0,44.9) [1.363] 
L = 50 (1.3,3.4,7.4,14.9,27.5,45.4) [1.361] 
L large (1.1,3.3,7.7,15.4,27.6,45.0) [1.362] 



(9.2, 1 0.0, 1 1 .7, 1 5.0, 24.2, 30.0) [1 .690] 
(14.1, 14.5, 15.1, 16.3, 18.5,21.5) [1.780] 
(1 6.4, 1 6.4, 1 6.5, 1 6.6, 1 6.9, 1 7.2) [1 .792] 
uniform distribution irrespective of a 



(9.2, 1 0.1 , 1 1 .8, 1 5.1 , 23.8, 29.9) [1 .691] 
(1 0.2, 11.1,1 2.6, 1 5.7, 21 .8, 28.5) [1 .71 8] 
(1 5.0, 1 5.2, 1 5.7, 1 6.5, 1 7.9, 1 9.7) [1 .787] 
uniform distribution irrespective of a 



like fair-throw model 



N = 6,a = 7/2 



model 



old throw, P(R]\A% A 1)1% [tt/nat] 



new throw, P(R°\A% A 1)1% [ff/nat] 



ME 



(1 6.7, 1 6.7, 1 6.7, 1 6.7, 1 6.7, 1 6.7) [1 .793] 



fair-t. /ft (1 5.0, 1 7.0, 1 8.0, 1 8.0, 1 7.0, 1 5.0) [1 .789] uniform distribution irrespective of a 



Johnson if: 



K = 1 
K = 5 
K = 50 
K large 



(1 3.0, 1 6.1 , 20.8, 20.8, 1 6.1 , 1 3.0) [1 .772] 
(1 4.5, 1 6.9, 1 8.6, 1 8.6, 1 6.9, 1 4.5) [1 .787] 
(1 5.0, 1 7.0, 1 8.1 , 1 8.1 , 1 7.0, 1 5.0) [1 .790] 
(1 5.0, 1 7.0, 1 8.0, 18.0,1 7.0, 1 5.0) [1 .789] 



like fair-throw model 



multiplicity / m : 

L= 1 (13.1,16.2,20.7,20.7, 16.2, 13.1) [1.774] 
L = 5 (13.5,16.6, 20.0, 20.0, 1 6.5, 1 3.5) [1 .780] 
L = 50 (1 4.7, 1 7.0, 1 8.3, 1 8.3, 1 7.0, 1 4.7) [1 .788] 
L large (1 5.0, 1 7.0, 1 8.0, 18.0,1 7.0, 1 5.0) [1 .789] 



(1 4.8, 1 6.4, 1 8.8, 1 8.8, 1 6.4, 1 4.8) [1 .787] 
(1 6.3, 1 6.7, 1 7.0, 1 7.0, 1 6.7, 1 6.3) [1 .792] 
(1 6.6, 1 6.7, 1 6.7, 1 6.7, 1 6.7, 1 6.6) [1 .792] 
uniform distribution irrespective of a 



(1 4.9, 1 6.4, 1 8.7, 1 8.7, 1 6.4, 1 5.0) [1 .788] 
(15.4, 16.7, 18.0, 18.0, 16.6, 15.4) [1.791] 
(1 6.5, 1 6.7, 1 6.8, 1 6.8, 1 6.7, 1 6.5) [1 .792] 
uniform distribution irrespective of a 



like fair-throw model 
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N = 12, a = 6 

model old throw, P(Rj \ A N a a 1)1% [tf/nat] new throw, P(/J?| A* A 1)1% [ff/nat] 



ME 

fair-t. /ft 

Johnson If: 
K = 1 
K = 5 
K = 50 
a: large 



multiplicity /£: 
L= 1 
L = 5 
L = 50 
L large 



(0,0,0,0,0,100) [0] 



(0,0, 0,0,0,100) [0] 

(0,0,0,0,0,100) [0] 
(0,0,0,0,0,100) [0] 
(0,0,0,0,0,100) [0] 
(0,0,0,0,0,100) [0] 

like fair-throw model 

(0,0,0,0,0,100) [0] 
(0,0,0,0,0,100) [0] 
(0,0,0,0,0,100) [0] 
(0,0,0,0,0,100) [0] 



uniform distribution irrespective of a 

(5.6, 5.6, 5.6, 5.6, 5.6, 72.2) [1 .042] 
(1 1 .9, 1 1 .9, 1 1 .9, 1 1 .9, 1 1 .9, 40.5) [1 .633] 
(1 6.0, 16.0,1 6.0, 1 6.0, 1 6.0, 1 9.9) [1 .787] 
uniform distribution irrespective of a 



(5.7, 5.7, 5.7, 5.7, 5.7, 71 .6) [1 .056] 
(7.0, 7.0, 7.0, 7.0, 7.0, 64.9) [1 .211] 
(13.9, 13.9, 13.9, 13.9, 13.9,30.7) [1.734] 
uniform distribution irrespective of a 



like fair-throw model 



jV=12,g = 5 

model old throw, V(R)\ A N tt a 1)1% [tf/nat] new throw, P(^f| A N a A 1)1% [#/nat] 



ME 



(2.1 , 3.9, 7.2, 1 3.6, 25.5, 47.8) [1 .370] 



fair-t. 4 (1 .6, 3.6, 7.4, 1 4.4, 26.6, 46.4) [1 .366] uniform distribution irrespective of a 
Johnson if: 

K = 1 (2.7, 4.3, 6.6,11.7, 26.6, 48.2) [1 .367] 
K = 5 (2.2, 4.0, 7.2, 1 3.2, 25.2, 48.3) [1 .368] 
K = 50 (1.7, 3.6, 7.4, 1 4.3, 26.3, 46.7) [1 .367] 
K large (1 .6, 3.6, 7.4, 1 4.4, 26.6, 46.4) [1 .366] 



like fair-throw model 

multiplicity /^: 

L = 1 (2.6, 4.2, 6.6, 1 1 .8, 26.4, 48.3) [1 .363] 
L = 5 (2.5, 4. 1 , 6.8, 1 2.4, 25.6, 48.5) [1 .365] 
L = 50 (1.9, 3.8, 7.3, 1 3.7, 25.8, 47.5) [1 .366] 
L large (1 .6, 3.6, 7.4, 1 4.4, 26.6, 46.4) [1 .366] 



(7.3, 8.4, 9.9, 1 3.4, 23.3, 37.7) [1 .605] 
(12.5, 13.0, 14.0, 15.7, 19.1,25.7) [1.756] 
(1 6.1 , 1 6.2, 1 6.3, 1 6.6, 1 7.0, 1 7.8) [1 .791] 
uniform distribution irrespective of a 



(7.4, 8.5, 10.0,1 3.5, 23. 1 , 37.5) [1 .609] 
(8.1 , 9.2, 1 0.9, 1 4.4, 22. 1 , 35.2) [1 .645] 
(1 3.7, 1 4.2, 1 4.9, 1 6.3, 1 8.6, 22.4) [1 .777] 
uniform distribution irrespective of a 



like fair-throw model 
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N = 12, a = 7/2 



model 



old throw, P(R}\A» A 1)1% [ff/nat] 



new throw, P(R°\A% A 1)1% [ff/nat] 



ME (1 6.7, 1 6.7, 1 6.7, 1 6.7, 1 6.7, 1 6.7) [1 .793] 

fair-t. /f, (15.9, 16.8, 17.3, 17.3, 16.8, 15.9) [1.791] uniform distribution irrespective of a 



Johnson If: 



K = 


1 


(1 3.5, 1 6.5, 20.0, 20.0, 1 6.5, 1 3.5) [1 


.779] 


K = 


5 


(15.3,16.9,17.9,17.9,16.9,15.3) [1 


.791] 


K = 


50 


(15.8,16.8,1 7.4, 1 7.4, 1 6.8, 1 5.8) [1 


.791] 


K large 


(15.9,16.8,17.3,17.3,16.8,15.9) [1 


.791] 






like fair-throw model 





multiplicity / m : 

L = 1 (1 3.5, 1 6.6, 1 9.9, 1 9.9, 1 6.6, 1 3.6) [1 .780] 
L = 5 (14.1,16.8,19.1,19.1,16.8,14.1) [1.784] 
(15.5,16.9,1 7.6, 1 7.6, 1 6.9, 1 5.5) [1 .790] 
(1 5.9, 1 6.8, 1 7.3, 1 7.3, 1 6.8, 1 5.9) [1 .791 ] 



L = 50 
L large 



(1 4.5. 16.6.1 8.9. 1 8.9. 1 6.6. 1 4.5) [1 .786] 
(1 6.3, 1 6.7, 1 7.0, 1 7.0, 1 6.7, 1 6.3) [1 .792] 

(1 6.6. 1 6.7. 1 6.7. 1 6.7. 1 6.7. 1 6.6) [1 .792] 
uniform distribution irrespective of a 



(1 4.6, 16.6,1 8.7, 1 8.7, 1 6.6, 1 4.7) [1 .786] 
(1 5.2, 1 6.8, 1 8.0, 1 8.0, 1 6.7, 1 5.2) [1 .789] 
(1 6.5, 16.7,1 6.8, 1 6.8, 1 6.7, 1 6.5) [1 .792] 
uniform distribution irrespective of a 



like fair-throw model 



N large, a = 6 



model old throw, P(R] | A N a A 1)1% [tf/nat] new throw, P(^f| A N a A 1)1% [tf/nat] 

ME 

fair-t. 4 



Johnson If: 
K = 1 
K = 5 
K = 50 
K large: 
N/K small 

K large: 
N/K large 



multiplicity / m : 
L= 1 
L= 5 
L= 50 
L large: 
N/L small 

L large: 
N/L large 



(0,0,0,0,0,100) [0] 



(0,0,0,0,0,100) [0] 

uniform distribution irrespective of a 



ME distribution 



(0,0,0,0,0,100) [0] 
(0,0, 0,0,0,100) [0] 
(0,0, 0, 0,0,100) [0] 

(0,0,0,0,0,100) [0] 

like fair-throw model 

(0,0,0,0,0,100) [0] 

ME distribution for Burg entropy 

(0,0,0,0,0,100) [0] 
(0,0, 0, 0,0,100) [0] 
(0,0, 0, 0,0,100) [0] 

(0,0,0,0,0,100) [0] 

like fair-throw model 

(0,0,0,0,0,100) [0] 



(0,0,0,0,0,100) [0] 
(0,0, 0,0,0,100) [0] 
(0,0, 0, 0,0,100) [0] 

uniform distribution irrespective of a 



(0,0,0,0,0,100) [0] 

ME distribution for Burg entropy 

(0,0,0,0,0,100) [0] 
(0,0, 0, 0,0,100) [0] 
(0,0, 0, 0,0,100) [0] 

uniform distribution irrespective of a 



(0,0,0,0,0,100) [0] 



ME distribution 



ME distribution 



18 



Porta Mana 



Plausibility logic and maximum entropy 



N large, a = 5 



model 
ME 

fair-t. /ft 



Johnson if: 
K=-\ 
K = 5 
K = 5Q 
K large: 
N/K small 

K large: 
N/K large 



old throw, PCR, 1 1 A% A 1)1% [tt/nat] new throw, P(i??| A N a A /)/% [tf/nat] 



(2.1,3.9, 7.2,13.6, 
(2. 1,3.9, 7.2, 1 3.6, 25.5, 47.8) [1 .370] 



ME distribution 



(4.0, 5.0, 6.7, 1 0.0, 20.0, 54.3) [1 .343] 
(4.3, 5.3, 6.9, 9.8, 1 7.2, 56.5) [1 .328] 
(4.3, 5.3, 6.9, 9.8, 1 6.7, 56.9) [1 .323] 



25.5,47.8) [1.370] 

uniform distribution irrespective of a 



(4.0, 5.0, 6.7, 1 0.0, 20.0, 54.3) [1 .343] 
(4.3, 5.3, 6.9, 9.8, 1 7.2, 56.5) [1 .328] 
(4.3, 5.3, 6.9, 9.8, 1 6.7, 56.9) [1 .323] 



(2.1,3.9,7.2, 13.6,25.5,47.8) [1.370] uniform distribution irrespective of a 



like fair-throw model 



(4.4, 5.3, 6.9, 9.8, 1 6.7, 57.0) [1 .325] (4.4, 5.3, 6.9, 9.8, 1 6.7, 57.0) [1 .325] 



ME distribution for Burg entropy 



ME distribution for Burg entropy 



multiplicity / m : 
L = 1 
L = 5 
L = 50 
L large: 

N/L small (2.1,3.9,7.2, 13.6,25.5,47.8) [1.370] uniform distribution irrespective of a 



(4.0, 5.0, 6.7, 1 0.1 , 20.2, 54.1 ) [1 .347] 
(3.6, 4.7, 6.6, 1 0.7, 21 .8, 52.5) [1 .352] 
(2.3, 3.8, 7.0, 1 3.4, 25.5, 48.0) [1 .367] 



(4.0, 5.0, 6.7, 1 0.1 , 20.2, 54.1 ) [1 .347] 
(3.6, 4.7, 6.6, 1 0.7, 21 .8, 52.5) [1 .352] 
(2.3, 3.8, 7.0, 1 3.4, 25.5, 48.0) [1 .367] 



like fair-throw model 



L large: 
N/L large 



(2. 1,3.9, 7.2, 1 3.6, 25.5, 47.8) [1 .370] (2.1 , 3.9, 7.2, 1 3.6, 25.5, 47.8) [1 .370] 



ME distribution 



ME distribution 
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N large, a = 3.5 



model 



old throw, P(R]\A% A 1)1% [tf/nat] 



new throw, P(R°\A% A /)/% [ff/nat] 



ME 

fair-t. /ft 

Johnson If: 
K = 1 
K = 5 
K = 50 
K large: 
N/K small 

N/K large 



(16.7,16.7,16.7,16.7, 
(1 6.7, 16.7,1 6.7, 16.7,1 6.7, 1 6.7) [1 .793] 



16.7,16.7) [1.793] 
uniform distribution irrespective of a 



ME distribution 



(14.1,1 6.6, 1 9.3, 1 9.3, 1 6.6, 1 4. 1 ) [1 .784] 
(1 6.1 , 1 6.8, 1 7.2, 1 7.2, 1 6.8, 1 6.1 ) [1 .793] 
(1 6.6, 1 6.7, 1 6.7, 1 6.7, 1 6.7, 1 6.6) [1 .792] 



(14.1,1 6.6, 1 9.3, 1 9.3, 1 6.6, 1 4. 1 ) [1 .784] 
(16.1,1 6.8, 1 7.2, 1 7.2, 1 6.8, 1 6. 1 ) [1 .793] 
(1 6.6, 1 6.7, 1 6.7, 1 6.7, 1 6.7, 1 6.6) [1 .792] 



(1 6.7, 1 6.7, 1 6.7, 1 6.7, 1 6.7, 1 6.7) [1 .793] uniform distribution irrespective of a 



like fair-throw model 

(1 6.7, 1 6.7, 1 6.7, 1 6.7, 1 6.7, 1 6.7) [1 .793] 

ME distribution for Burg entropy 



(1 6.7, 1 6.7, 1 6.7, 1 6.7, 1 6.7, 1 6.7) [1 .793] 

ME distribution for Burg entropy 



multiplicity / m 
L= 1 
L = 5 
L = 50 
L large: 

N/L small (1 6.7, 1 6.7, 1 6.7, 16.7,1 6.7, 1 6.7) [1 .793] uniform distribution irrespective of a 



(1 4.2, 16.6,1 9.2, 1 9.2, 16.6,1 4.2) [1 .784] 
(14.9, 16.8, 18.3, 18.3, 16.8, 14.9) [1.788] 
(1 6.5, 1 6.7, 1 6.8, 1 6.8, 1 6.7, 1 6.5) [1 .792] 



(1 4.2, 1 6.6, 1 9.2, 1 9.2, 16.6,1 4.2) [1 .784] 
(1 4.9, 1 6.8, 1 8.3, 1 8.3, 1 6.8, 1 4.9) [1 .788] 
(1 6.5, 1 6.7, 1 6.8, 1 6.8, 1 6.7, 1 6.5) [1 .792] 



like fair-throw model 

N/L large (1 6.7, 1 6.7, 1 6.7, 16.7,1 6.7, 1 6.7) [1 .793] 



(1 6.7, 1 6.7, 1 6.7, 1 6.7, 16.7,1 6.7) [1 .793] 



ME distribution 



ME distribution 
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B Formulae 

In §[2]we introduced the propositions Rjr stating that throw j shows face 'V. Throw 
j = 1 is an 'old' throw, j = a 'new' one, in the sense already explained in the 
same section. The proposition / denotes the model used in our inferences and other 
background knowledge. 

Let Fx denote the statement that the number of occurrences of the six possible 
outcomes in N throws is N = (Nj). So F^ is a disjunction of conjunctions of Rs; e.g., 
for N = 3 and N = (0, 0, 0, 2, 1, 0) (two □ and one H), 

F (0,0,0,2,1,0) = (< } A Rf A Rf ) V A Rf A flf ) V (R^ A tff A flf ). (10) 

With a finitely or infinitely exchangeable model / for the old throws, the plausibility 
of a set of outcomes depends on their frequencies but not on which throws they occur; 
in our example, 

P(R ( 4 l) A Rf ] A Rf\ I) - P(^ 1} A Rf } A rQ } \ I) = P{R ( 5 1] A Rf A Rf\ I). (1 1) 

In this case it is a simple combinatorial exercise to show that for any old throw, e.g. 
the first, 

P(tf ) \F f rAl)=N i /N, (12) 

unless Fjv and / be incompatible, i.e. P(F^\ I) = 0, a case which we exclude. 

If the model is infinitely exchangeable with density e(p\ I) dp we have by stan- 
dard combinatorial arguments [e.g.0, § 1.2; HI, § 4.3.2;M § 2] 

P(F N \I) = J N^Y\j^jg(p\I)dp. (13) 
Knowledge of the frequency of old throws leads to the 'updated' density 

g(p\ F N A I) dp = — dp, (14) 

P(Fn\ F) 

from which we obtain the plausibility distribution for a new throw: 

r jN\ Pi (n,i)g(P\I)dp 
P(Rf } \ F N A I) = Pi g(p\ F N A I) dp — ■ ( 15 ) 

Recall that A% is the statement that in Af throws the observed average is a. This 
is obviously a statement about the possible outcome frequencies of old throws, and 
we can write it as 

v-N = aN, with v := (1,2,3,4,5,6). (16) 
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Hence 



A% = V F N ; (17) 



v-N=aN 

e.g., if the observed average in two throws is 5/2, 

(2) 

^5/2 = ^(1,0,0,1,0,0) v ^(0,1,1,0,0,0) v ^(0,0,0,0,2,0)- (18) 

Sums over frequencies constrained by the formula v • N = aN will for brevity be 
denoted by YIm ■ 

From eq. (TTTT ). any plausibility conditional on can be resolved into a weighted 
sum of plausibilities conditional on the 

Using formulae dT2])-([T5]) and £[9]> we find 

P(/?«| A^ A I) = 7 , (20a) 

P(/?f I A* A /) 7 ■ (20b) 

With these formulae we can compute the plausibilities required in our study. 

Integration of these expressions is straightforward for the fair-throw model. We 
obtain 

y(«) f.-N r-i jV!_ 

P(«!V>/,,)= £ ;,4 (21a) 

Zijv D 11' JV ; ! 

p ( /? (o) i a^ a 7 ft) = ll/jv ' ! = I. ( 2ib) 

Note how the plausibility distribution for a new throw is independent from any 
knowledge about old — or any other — throws. This model, like any other i.i.d. 
one, does not allow to 'learn from experience'. 

By eq. © for enough large N the first plausibility above is approximated by 



v(r ( ?\a n u a / ft ) - * (a ;- with & n,-/n 

fi which maximizes //(/) under the constraint v • f — a, 



1$ ft av[NH(f)ldf 



I™ exp[NH(f)]df (22) 
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where H is Shannon's entropy ([3]). This is the result of van Campenhout and Cover 
il] and Csiszar |@]: for old throws, the maximum-entropy distribution is the asymp- 
totic distribution for an old throw conditional on the average of a large number of old 
throws, under the assumption of a uniform i.i.d. model. 



In the case of the Johnson model, with density 

' P_f- 1 
Y{K) 



/ 



dp, 



K>0, 



©r 



the integrals above can be reduced to Dirichlet's generalization of the beta integral 
M § 1.8], 



Rebi > 0, 



and after some simplifications we obtain the closed-form formulae 



P(tf»/f) = 



P(*S°>|A?A/f) 



5$ Ih 



(Ni+K-iy. ' 

y (a) Nj+K n (N,+K-l)[ 
N+6K j_U N,] 

v (q) n (Ni+K-iy. 

Hi AT,! 



(23) 



(24a) 



(24b) 



Note how old throws are weighted means of Ni/N, new throws of (A 7 ,- + K)/(N + 
6K). The Johnson model behaves as if we knew about the existence of 6K additional 
old throws in which each face occurred K times. Cf. Jaynes [21], Johnson 01, Zabell 

i2ch 

If K is large we can use eq. (0 to show that the above formulae asymptotically 
become 



T>{Rf\A N a Mf) 



y(«) Ni , 
N 



n N] 
Hi W- 



y((j) 

y(o) I 
6 



6- w n/ § 



-jv n m 
11/ AM 



W,! 

jV! 
Nil 



y(a) 6 -N Tim 



(25a) 



(25b) 



and the Johnson model is approximated by the fair-throw one. This is also true for N 
large but enough smaller than K. 

When N is enough large and K finite eq. (O can again be used to show that 
expressions (I24ab are both approximated by the integral 



PCsSMA/f) 



p ( <V>/f) 



tn,rd/ ' 



(26) 



where A a is the intersection between the plausibility simplex A ((5]) and the constraint 
hyperplane v •/ = a. If K is also enough large but enough smaller than N the integral 
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above becomes 

J Aa fi U, ff~ X df j Aa fi cxp[KH B (J)] df 
Lntf-^f * I fl exp[^// B (/)]d/ 



(27) 



where Hq is Burg's entropy A3 ill : 

H B (f) := ZMfi. (28) 

Then 

PCR? 5 ! ^ A /f ) « PCR?>| AJT A /f ) « 

fi which maximizes H#(f) under the constraint v ■ f - a. (29) 

Thus the Johnson model yields the maximum-entropy principle with Burg's entropy; 
cf. Jaynes 0,p. 19]. Had we used a generalized Johnson model with density 



8 (p\if' m )dp = m + K) 



Kmi-l 



dp, K,m i >0,Z i m i = l (30) 



the asymptotic expression (1291 would have had the Kullback-Leibler divergence 

-D(m,f):=-Zm i ln(m i /f i ) (31) 

i 

in place of Burg's entropy H^{f). 

For the case in which both N and K are large and comparable with each other, 
see the following analogous discussion for the multiplicity model. 

The integrals for the multiplicity model, with density 

g(p\ 1^) dp = c(L) L! - dp, L>1, © r 

1 iA^Pi)- 



are 



/E?fAnL.n^ )dp 

P(R\ l) \A» A 4) = ^ W{ / , (32a) 



PtffV? A 4) -± — -- — . (32b) 

j^NlL^j^.jdp 
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When L is large enough, the usual relationship (O between Shannon's entropy 
and the multiplicity factor leads to the approximations 



P(*PA>4) 



J Zj? Pi A^l(n/ 11) exp[L//(p)] dp 



(0). aN T L, „ •>-»'■ \_ »>'., Z^y 6 » 11/ w 



^JV U AT; I 

(33a) 

^/v 6 b iiim i 



r / p n '\ y^f.-N n M. 6' 



(33b) 



thus the multiplicity model is approximated by the fair-throw one for enough large 
values of L, also when N is large but smaller than L. 

What happens if ,/V is large and L finite? As for the Johnson model, the integrals 
can be approximated by their restriction to the hyperplane v ■ f - a: 

P(^\A» A f m ) « P(R?V? A 4) * JA ! — — — T- ■ 04) 

J Afl [riK^//)!] d/ 

Most interesting is the case in which L is also large but still smaller than N. From 
eq. © as usual we see that the integrals above are approximated by 



J Aa fi [YliiLfiV-T'df j Aa fi exp[L//(/)] df 
UmLfi^df " J exp[L//(/)]d/ 



(35) 



so that 



P^V^A/^-PC^IA^A^)- 

/i which maximizes //(/) under the constraint v • f — a. (36) 

That is, for large A", L, and Af/L, the distribution given by the multiplicity model for 
old and new throws is equal to that of the maximum-entropy principle. 

It is easy to see why. For N large, and larger than L, the data make the posterior 
density of the model very peaked on the hyperplane determined by the average a. On 
this hyperplane, however, the posterior density is proportional to the prior one. For 
large values of the parameter L the density of the multiplicity model tends to have 
isopycnals (contour lines of same density, with respect to the canonical density dp) 
coinciding with the isentropes of the simplex of plausibility distributions, and very 
peaked on those of larger entropy. Hence the final distribution corresponds to that of 
larger entropy in the 'constraint' hyperplane determined by the average. 

The assigned distribution is thus, in general, determined by the competition be- 
tween two peaks: that of the prior density, centred on the uniform distribution, and 
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that of the likelihood of the data, concentrated on the 'constraint' hyperplane. For 
a fixed, large L and small N the first peak dominates and the assigned distribution 
is near the uniform one. As N increases the assigned distribution moves towards 
the constraint hyperplane; and when N becomes much larger than L it is practically 
on that hyperplane, though its exact position therein is still determined by the prior 
density. This is why the multiplicity model gives reasonable distributions for small 
N but can approximate the maximum-entropy distribution for large N. 

Of course this property is not exclusive to this model. Any other model with 
isopycnals coinciding with isentropes and very peaked on those of higher entropy 
will lead to the same distribution in problems with N large enough; e.g. a model with 
the 'entropy prior' of Skilling, Rodriguez, Caticha & Preuss discussed in § |4] 

Using the generalized multiplicity model defined by the density 

Lpi 

g(p\ if") dp = c(L, m)L\ -r-Ap, L > 1, m > 0, m ; = 1, (37) 

the asymptotic result d38l > generalizes to the 'maximum-relative-entropy' principle, 

P^IA^A/^-P^IA^A/m''")- 

/, which minimizes D(f, m) under the constraint v ■ f = a, (38) 

with the Kullback-Leibler divergence (f3TT > instead of Shannon's entropy. Note that 
the roles of / and m are interchanged with respect to the Johnson model's case. 

The integrals (l26l ). d32l) . (1341 were numerically calculated with both the Monte 
Carlo routine Suave and the deterministic routine Cuhre of Hahn's multidimension- 
al-integration library Cuba ll63ll . comparing the results of both routines to appraise 
their mutual consistency and precision. In most cases Suave was fastest and most 
precise. 



C The canonical density on a plausibility simplex: a 
whimsical definition 



Integrations over a plausibility simplex A of dimension n are usually written as 

n\\ ■•• f{ Pi )dp n . l ---dp 3 dp 2 dp l (39) 

Jo Jo Jo Jo 

or equivalently and more symmetrically as 

n\ [ f ••• f g(Pi)Kl-Y.Pi)dPn---dp2dp\ (40) 
Jo Jo Jo 

or even 

n\ \ ••• g( Pi )8(\-ZPi)dpn---dp 2 dp l . (41) 

Jo Jo Jo 
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When g = 1 these integrals give the simplex a unit volume. 

The first expression is unpleasant because asymmetric in the symmetric variables 
pf, the other two are unpleasant because they unnecessarily invoke a generalized 
function. Behind these expressions there is a simple and well-behaved canonical den- 
sity or volume element n\ dp over the simplex, which gives the latter a unit volume. 
By density I mean a twisted, positively oriented n-fomv. Twisted, or 'odd', because 



its integration does not require an inner orientation of the simplex (see Schouten [64 



6J, chs II, III], Burke [66; 67, ch. IV; 68]; also Marsden et al. |6J, ch. 7] and Choquet- 
Bruhat et al. IT70l. § IV.B.l]); in fact, choosing an inner orientation would break the 
permutation invariance of the functions p,. 

This canonical density can be defined in two ways. 

First way. Any ^-dimensional convex set that can be affinely mapped onto a finite 
region W 1 by a map F can be given a canonical density w by pulling back the the 
canonical density of R" and rescaling: 

cu:=™ ,42) 
/ F*|dx| 

where |dx| is the canonical density (twisted, positively oriented «-form) of R n . The 
rescaling gives the convex set a unit volume. It is easily proven that this definition is 
independent of the affine map F chosen. When the convex set is a simplex, cv is the 
canonical density nl dp. 

The second way does not involve any embeddings, but uses instead the natu- 
rally defined plausibility functions pi : A — > R, which characterize the simplex as a 
plausibility simplex. The canonical density is then implicitly defined as 

nl dp := ^ \dp h A • • • A dp in _, |, (43) 

{ii in-l) c [\,...,n) 

the indices running over all permutations of n - 1 elements of {1, . . . , n}. The magni- 
tude operator | • | transforms any density into the twisted, positively oriented equipol- 
lent one. The expression above is symmetric on the pi and does not involve general- 
ized functions. 

For other densities and metric structures on a plausibility simplex see e.g. Amari 



& Nagaoka tlM- 



And measure theory? 'La teoria della misura sta alia probability come lo stucco 



messo male sta alle pareti: prima o poi cade', said Gian-Carlo Rota 1172147411 . 
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